System and method for software code optimization

ABSTRACT

A method is provided of optimizing a software program for a target processor in order to meet specific performance objectives and yet maintain portability, where the software program is initially coded in a high-level language. The method includes a first step of optimizing the software program in the high-level language, using optimizations that are substantially independent of the target processor to host the application. Preferably, if the performance objectives are met after the completion of this step, then the process preferably successfully terminates. However, if the performance objectives are not met, then the method preferably proceeds to a second step. In the second step, the initially optimized form of the software program is again optimized in the high-level language, although target processor-dependent optimizations are used. If the performance objectives are met after completing this second step, then the process preferably terminates. If the performance objectives are not met, then the process proceeds to a third step. In the third step, the twice-optimized software program is optimized using a low-level language of the target processor on key portions of the code, such that although the software implementation becomes target-dependent, it remains relatively portable.

[0001] This application claims priority to a U.S. ProvisionalApplication entitled “System-on-a-Chip-1,” having Ser. No. 60/216,746and filed on Jul. 3, 2000, and which is hereby incorporated by referenceinto this application as though fully set forth herein.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] The present invention relates to the field of softwaredevelopment, and particularly, to system and methods for software codeoptimization.

[0004] 2. Background

[0005] In the design of software for digital signal processing (DSP) andother applications, programmers take advantage of the low-level buthigh-speed capabilities that the particular target processor (e.g., DSPprocessor, microcontroller) offers in order to achieve the performancerequirements for the applications. However, the application of thesetools early in the development process leads to the development of aprogram that may be unportable should a different target processor besubsequently used to host the program. The development of code that isnot portable from one target processor to another may result insignificant redesign and development costs for the same basicapplication. Often, if the program is portable, the most significantissue is the cost in time and resources to port the application to a newhost, such that the performance requirements of the program on the newhost are met.

[0006] A need exists therefore for a system and method that minimizesthe likelihood that a development process for software will result in aprogram that is unportable from one target processor to another.

SUMMARY OF THE INVENTION

[0007] The present invention, in one aspect, provides a systems andmethods for optimizing software for execution on a specific hostprocessor.

[0008] In one embodiment, a method is provided of optimizing a softwareprogram for a target processor in order to meet specific performanceobjectives, where the software program is coded in a high-levellanguage. The method includes the steps of first optimizing the softwareprogram in the high-level language, using optimizations that aresubstantially independent of the target processor to host theapplication. Preferably, if the performance objectives are met after thecompletion of this step, then the process preferably successfully exits.Thus, if the performance objectives are not met, then the methodpreferably proceeds to a second step.

[0009] In the second step, the initially optimized form of the softwareprogram is again optimized in the high-level language, although targetprocessor-dependent optimizations are used. If the performanceobjectives are met after completing this second step, then the processpreferably terminates. If the performance objectives are not met, thenthe process proceeds to a third step.

[0010] In the third step, the twice-optimized software program isoptimized using a low-level language of the target processor on keyportions of the code, such that although the software implementationbecomes target-dependent, it remains relatively portable. Preferably, inevaluating whether the performance objectives have been achieved,performance profiles are determined for the intermediate forms of theoptimized software program. These performance profiles are thenpreferably quantitatively compared to the previously defined performanceobjectives.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011]FIG. 1 is a flow diagram generally depicting steps in a process ofoptimizing a software program for a target processor representing apreferred embodiment of the invention.

[0012]FIG. 2 is a flow diagram depicting a preferred embodiment of asimplified representation of the main steps of the genericimplementation process represented as one step in FIG. 1.

[0013]FIG. 3 is a flow diagram depicting a preferred embodiment ofdetailed steps of the generic implementation process depicted in FIG. 2.

[0014]FIG. 4 is a graph depicting examples of curves of the evolution ofa software application with respect to its performance and size in atarget-independent optimization process.

[0015]FIG. 5 is a flow diagram depicting a preferred embodiment of asimplified representation of the main steps of the specificimplementation process represented as one step in FIG. 1.

[0016]FIG. 6 is a flow diagram depicting a preferred embodiment ofdetailed steps of the specific implementation process depicted in FIG.5.

[0017]FIG. 7 is a graph depicting examples of curves of the evolution ofa software application with respect to its performance and size in atarget-dependent optimization process.

[0018]FIG. 8 is a flow diagram depicting a preferred embodiment of stepsof the fully dedicated implementation process represented as one step inFIG. 1.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0019] In a preferred embodiment of a software code optimization method(or process) comprising multiple basic steps, each successive basic stepgenerally results in code that is closer to being dedicated to operateon a particular target. Thus, to promote portability, if the performancegoals of the application are reached after the completion of any step inthe process, then the optimization process is terminated.

[0020] The evaluation of performance, and thereby, whether theperformance meets stated, predefined objectives preferably accounts forseveral factors. Preferably, a key performance measure is the real-timespeed of the application when operating on the specified target. Anotherperformance measure is the accuracy and/or quality of the output.Another factor that may be integrated into the process evaluation is thebinary code size. While the application is made to fit in the targetprocessor's memory, this factor generally is less and less important asmemory sizes increase, become smaller, cheaper and less power consuming.

[0021] Thus, one major step in the optimization process is to fix theinitial constraints that are applied to the development and optimizationof the software application. These constraints are preferably used toquantitatively evaluate the application implementation's performance inan overall sense, and facilitate determining the feasibility of portingthe application to a specific target at each of the development stages.These measures preferably inherently integrate processing performancecharacteristics of the target processor, including its clock frequency,which relates to the number cycles available to execute the application.

[0022] Another set of parameters that preferably are calculated is theglobal I/O data flow to determine if the memory accesses (read/write)are achievable for the specified target. This set of parametersintegrate elements like the aggregated data flow over the internal andexternal buses (data exchanges).

[0023]FIG. 1 is a flow diagram generally depicting steps in a process ofoptimizing a software program for a target processor. In the embodimentrepresented in FIG. 1, an optimization process 100 comprises threeoptimization steps. In a first optimization step 102, the software forthe DSP processor target is written in a high-level language such as C,C++ or Ada. Preferably, the programming language is one that iscompletely portable between all probable DSP targets. In coding thesoftware, optimization techniques particular to the language arepreferably used. The code optimization in this step 102 preferably doesnot employ any optimization tools that depend on the processor that ismeant to host the application.

[0024] Upon completion of this step 102, the new implementation, as anext step 104, is evaluated to determine whether the performance goalshave been reached. If by completing the step 102 of target-independentoptimization the performance objectives are achieved, then the overalloptimization process 100 successfully terminates 112. If the performancerequirements for the application have not been achieved, then in a nextoptimization step 106, certain portions of the software code arere-implemented in the high-level language to take advantage of thespecific processing capabilities of the DSP target.

[0025] While the code after this step 106 is less portable than the codethat results after the previous step 102, the software may remainpartially portable for a number of reasons. One reason is that themodified code is preferably selected from a portion of code that isshort in terms of lines of source code, but is repeatedly executed andis thus responsible for a relatively significant percentage of theprocessing overhead. By first modifying the code that fits thesecriteria, the amount of code that must be modified is minimized, and mayadditionally be flagged in the source file to indicate that it istarget-specific code. If the target processor later changes, only theseidentified portions need be addressed for optimization. Further, thepreviously unoptimized code, corresponding in functionality to theportion of code that is optimized in this step 106, may remain coded inthe source file. This original unoptimized code may be used as astarting point for optimizing the same portion of code of any subsequenttarget processor. Another benefit is that although the coding isspecific to the DSP target for the application, the code preferablyremains in the high-level language. By remaining in a high-levellanguage (versus being re-coded in a low-level language such as anassembly language), the resulting code is inherently much easier torevisit and comprehend should modifications be necessary.

[0026] Preferably, software-profiling tools are applied to readilyidentify the portions of code that fit the criteria required to bepreferred candidates for optimization, so that they then can beoptimized for the particular DSP target as necessary.

[0027] Once a section of code has been optimized for a particular DSPtarget, other portions of code that meet the criteria to be candidatesfor optimization may also be optimized for the DSP target, if theperformance criteria for the application have still not been achieved.

[0028] If the portions of code that are candidates for optimization havebeen optimized, then in a next step 108, the implementation is againevaluated to determine whether the performance objectives on the targetprocessor have been met. As in step 104, if the performance objectivesare achieved though the step 102 of target-dependent optimization usingthe high-level language, then the overall optimization process 100successfully terminates 112. However, if the performance goals of theapplication have not been met, then the optimization process 100proceeds to a third optimization step 110. In this step 110, thesoftware is configured to be fully dedicated to the architecture andprocessing benefits of the target processor. Various coding techniquesthat are particular to the target processor for the application may beemployed. Some of these techniques include executing instructions inparallel or using any pipeline processing or other specializedprocessing capabilities. Further, tradeoffs may be made betweenperformance and throughput in order to meet pre-stated objectives of theapplication. The result of the process is an efficiently created programfor a DSP, microcontroller, or other computing target processor thatmeets pre-stated performance objectives, and is optimal, or close tooptimal, with respect to its portability, thereby minimizing futuresoftware development efforts for the same application.

[0029] In one embodiment of a system for performing the optimizationmethod, the method is performed automatically after the software codehas been initially developed in a high-level language. Preferably, thesystem is provided the performance parameters that are desired for theapplication, as well as the architectural specification of the targetprocessor. Given these inputs, the system then processes the high-levellanguage source code, compiles and simulates the code's execution, andtests the code against the specified performance requirements. If theperformance requirements are not met, the system profiles the code andthen optimizes the portions that are the best candidates foroptimization.

[0030] Preferably, the system comprises a software-optimizing processorin conjunction with memory that automatically performs the codeprofiling operations, code generation operating on portions of code thatare determined to be candidates for optimization, and then subsequentperformance analysis. The software-optimizing processor may comprise anytype of computer, and has processing characteristics dependent upon, forexample, the processing requirements for the code generation, profilingand performance assessment operations. It may comprise, e.g., acomputer, such as a workstation such as are manufactured by SunMicrosystems, a main frame computer, or a personal computer such as thetype manufactured by IBM or Apple. A computer executing optimizationsoftware is preferably used for the software-optimizing processor, dueto the utility and flexibility of a computer in programming, modifyingsoftware, and observing software performance. More generally, thesoftware-optimizing processor may be implemented using any type ofprocessor or processors that may perform the code optimization processas described herein.

[0031] Thus, the term “processor,” in its use herein, refers to a widevariety of computational devices or means including, for example, usingmultiple processors that perform different processing tasks or have thesame tasks distributed between processors. The processor(s) may begeneral purpose CPUs or special purpose processors such as are oftenconventionally used in digital signal processing systems. Further,multiple processors may be implemented in a server-client or othernetwork configuration, as a pipeline array of processors, etc. Some orall of the processing is alternatively implemented with hard-wiredcircuitry such as an application-specific integrated circuit (ASIC), afield programmable gate array (FPGA) or other logic device. Inconjunction with the term “processor,” the term “memory” refers to anystorage medium that is accessible to a processor that meets the memorystorage needs for a system for optimizing software. Preferably, thememory buffer is random access memory (RAM) that is directly accessed bythe software-optimizing processor for ease in manipulating andprocessing selected portions of data. Preferably, the memory storecomprises a hard disk or other nonvolatile memory device or component.

[0032] Preferred embodiments of the method for performing each of thebasic steps illustrated in FIG. 1 are now provided.

Generic Implementation Process (Target independent Optimization)

[0033] As used herein, the term generic means target-independent. In theDSP domain, in a target-independent implementation, the high-levellanguage source code, normally C− code, uses no specific function callsto pragma or macros dedicated to the target. With a target-independentimplementation, the portability of the application is maintained andsome optimization is integrated into the application at a high level,without using assembly language code.

[0034]FIG. 2 is a flow diagram depicting a simplified representation ofthe main steps of the generic implementation process 200 represented asstep 102 in FIG. 1. Preferred detailed steps of this process 200 aredepicted in the task flow diagram of FIG. 3. Preferably, as shown inFIG. 2, there are four main steps that take as input the mathematicaltheory related to a signal processing algorithm and lead to animplementation that is later used by the specific implementationprocess.

Stage G1: Floating Point Implementation

[0035] The floating-point implementation step takes as input thetheoretical solution of a process and transforms the solution into astructured language implementation. A main purpose of the step is to beable to reflect as much of the math in the theory into theimplementation.

[0036] Precision in the calculations is important in the floating-pointimplementation. In general cases, those applications are done usingdouble-floats and tools like the Cadence® Cierto™ signal processingworksystem or MathLab. Such tools provide representation of theprocessed data, allow graphical representation and comparison, andextract errors so that an implementation can be qualified.

[0037] For an integer DSP, the floating-point implementation transitionsto a fixed-point implementation linked to the precision that the DSP canhandle. For example, the DSP may need a 16-bit precision implementation.However, typically a group developing the floating-point application isnot the group developing the fixed-point implementation. This means thatthere are at least the following two approaches. In the first approach,the theoretical implementation is made with no consideration of theprecision. In that case, the implementation is oriented to processingquality and pushes the precision problem to the fixed-point porting. Inthe second approach, a target precision is involved at an early stage ofthe development and impacts the quality of the processing. This providesa full precision-oriented implementation. However, this implementationmust be entirely redone if the target architecture is changed. Detailsregarding floating point formats and related issues in terms ofimplementation are provided in several articles, including Morgan, Don,“Practical DSP Modeling, Techniques, and Programming in C,” John Wiley &Sons, 1995, pp. 263-298, and Lasley, P., Bier, J., Sholan, A., and Lee,Edward A., “DSP Processors Fundamentals, Architectures and Features,”IEEE Press Series on Signal Processing, 1997, pp. 1-30, which are herebyincorporated by reference as though fully set forth herein.

[0038] Stage G2: Fixed Point Implementation

[0039] In the stage of deriving the fixed-point implementation,trade-offs relating to precision may be made. The extent of thesetrade-offs primarily depends on the target DSP capability. If the targetprocessor is a 16-bit precision DSP, the accepted deviation of theoutput result will be greater than on a 32-bit DSP.

[0040] However, depending on the complexity of the algorithm, anotherfactor, the implementation architecture, is preferably considered. If animplementation involves hundreds of function calls, the real-timeexecution at the end of the implementation flow is impacted. For thisreason, two different steps in the implementation activity are utilized.

[0041] Another consideration at this level is the inheritance. A commonmethod of implementing signal processing is to take the floating-pointimplementation and port it to a specific target. Another method includesporting an existing fixed-point implementation to a new target. Themechanisms are quite different because of the availability of a firstimplementation. In the latter case, it is more an adaptation of anexisting application than a new implementation. The advantage is that itshortens the development process by reusing the existing code done foranother target.

Sub-stage G2.1: Processing Qualification

[0042] The goal of processing qualification is to obtain animplementation that preferably provides the best trade-off on the outputresult for a given precision. One of the tools that can accelerate thecompletion of this step is the Cierto™ signal processing worksystem.This tool provides the capability to validate, compare, and qualify aprocess with a reference to the floating-point implementation.

[0043] Fixing the derivation criteria depends primarily on theapplication category. For image processing comparison, information liketexture, edge, contrast and distortion is considered. For voiceprocessing, the same elements may be taken into account, but spectralanalysis, tone, volume and saturation, etc. may also be considered.Depending on the application domain, the criteria can be completelydifferent. Furthermore, within a given domain, the criteria can change.Radar can be used in military or agricultural activities but themeasures made for those two applications of radar image may be quitedifferent.

Sub-stage G2.2: Implementation Sizing

[0044] With a qualified algorithm in terms of quality and precision, thefirst sizing of the algorithm can be addressed. Preferably, theinformation gathered includes the real-time data flow, theimplementation structure and architecture, the profiling of instructionsand cycles, and the performances of the target DSP. These elements helpdetermine if the code can fit inside the target.

Real-Time Data Flow

[0045] The goal of real-time data flow is to understand the differentI/Os related to the algorithm that are to be integrated into the DSP. Onone level is the global data flow that globally indicates theavailability of the raw and the processed data. With the global dataflow, the developer identifies the processing delays that are going toprovide a basic characteristic of the application relating to data flow.

[0046] However, with the global data flow that corresponds to asimplified representation of the data flow, the real behavior of thedata coming in and out on the data bus of the system is not necessarilyclear. The programmer may have to zoom in the elementary time duration(selected for the global data flow representation) to characterize thebehavior of the implementation confronted with the interruptions comingfrom the devices involved in the process. This “elementary timeduration” can be very different from one application to another. It canbe the duration of, for example, an image frame, an image line, an audioframe, or a dedicated time dictated by control software or theprocessor.

[0047] Another data flow consideration is application cadence.Application cadence may impact all future decisions for the application.For example, in an interrupt-driven architecture, which is the case inmost of the real-time DSP constraint developments, it is then possibleto make clear design choices like use of a (first-in first-out) FIFOthat will buffer data. This option provides a more flexible way tomanage bus I/Os because it allows a better optimization of the bandwidthusage. It is generally a more expensive system design, but it isrecommended for processing that involves large amounts of data, likeimage processing.

[0048] Alternatively, a designer may choose not to use a FIFO. Thismeans that each piece of data produced is either immediately saved oreventually lost. This is the most constrained way of implementing asignal processing application, but is cheaper and well suited forprocessing that involves little data such as voice processing. Thisexample shows the impact of the application cadence criteria onapplication development.

[0049] Another factor is bandwidth. Such a study exhaustively integratesexternal data variables and code fetches. However, a strict and exactrepresentation of the detailed I/Os is generally impossible. All of therepresentations reflect only the static point of view. A real I/O studypreferably indicates that temporal drift impacts the overall bandwidthall along the application processing.

Implementation Organization

[0050] Also affecting implementation sizing is implementationorganization. Implementation organization preferably considers theimplementation structure, the architecture of the implementation, andthe behavior of the implementation.

Implementation Structure

[0051] The implementation structure generally means that the developerknows the number of functions implemented, the number of times they arecalled, split if possible into low-level and high-level functions, andso on. This first measure can be made manually or by using tools. Onedifficulty is identifying a tool that indicates the number of times afunction is called. Context switching can be expensive if it occurs toomany times. For this purpose, one can use free coverage tools likegprofs that provide part of the necessary information. The use of othertools like Sparcworks (Sun Microsystems) provides the call graph.

Implementation Architecture

[0052] The architecture of the implementation generally means knowingthe overall behavior of the application to know if it may be necessaryto revisit the algorithm construction to emphasize real-time issues.Given a specific processing algorithm that produces a signal processingdevelopment, the requirements can be formalized as follows:

[0053] 1) Obtain an “elementary” signal sample. This can be an audiovalue, an entire image, etc.

[0054] 2) Process the sample using the development made.

[0055] The first step in evaluating the feasibility of the applicationincludes determining the global data flow to fix the limit of theinput/output size and the precision (8, 16 or 32 bits). The first outputof the data flow indicates if it is possible to sustain the I/Os, butanother indication concerns the algorithm structure.

[0056] Effectively analyzing the processing flow in conjunction with thedata flow can indicate that the some steps of the processing cannot bedone before others. In that case, it may be possible to concludesometime that some delay constraints can not be matched or simply thatthe algorithm cannot run real-time.

[0057] There are several definitions of the delay. There is theintrinsic delay related to every processing called real processingdelay. In a data process, there is always the processing delay needed toperform the data transformation. But there is also the ArchitecturalDelay (AD) related to the structure of the algorithm. In this case, itis related to the algorithm architecture that never allows reducing thearchitectural delay.

Application Behavior

[0058] A majority of applications integrate some computations that usestatic correspondence tables or lookup tables to transform the signal.However, depending on the calculation results, the tables that are usedwill not be the same. If, for the same computation, the signalprocessing uses two different conversion tables that have differentsizes, then the application is non-deterministic. Thus, this part of animplementation is preferably clearly identified so that all of the stepsthat follow result in useful measures that can be accurately correlatedwith the performance increase of the application.

Profiling

[0059] The objective of high-level profiling is to provide a firstindication of the number of cycles consumed by the implementation andthe binary code size. If necessary, a simulator can also produce aninstruction profiling.

[0060] One difficulty is fixing the comparison criteria so that it isknown whether the application fits in the targeted DSP. However, it ispossible to get a ratio based on the different benches realized on theDSP. The benches are generally provided by the DSP vendors. This meansthat the cycle counts indicated at a higher level must be correlatedwith the performances of the target DSP to establish a go/no go process.

[0061] As an example, several DSP providers provide appropriate benches.Several DSPs are compared with C-written kernel functions including MAC:Multiply accumulate, Vect: Simple vectors multiply, Fir: FIR filter withredundant load elimination, Lat: Lattice synthesis, Iir: IIR filter,Jpeg: JPEG discrete cosine transform

[0062] From the application point of view, if the programmer takes theaverage cycle reduction ratio, it is possible to obtain a value of 2.8.From the DSP point of view, one can get a 2.78 gain factor. This is oneindicator.

[0063] On one hand, one thing that is not integrated in such benches isthe fact that a complete application merges many kinds of functions.This means that the optimization is less efficient for part of theimplemented algorithm. Furthermore, the application integrates severalfunction calls that add, in some cases, significant overhead.

[0064] On the other hand, such benches do not assume that the maximumpotential of the DSPs is exploited. One preferably measures the gainfactor to get effective comparison criteria for the go-no go decisionbecause the developer should go further down in assembly optimization.

[0065] Thus, if the generic C implementation cycle count indicates morethan five to six times the number of targeted cycles, the developer mayconsider that the real-time application is not reachable in a reasonableamount of time.

Stage G3: Reference C Code/Optimization

[0066] This implementation is the reference after generic codequalification to be optimized at the C level. Based on this code, oneapplies several rules concerning the method of implementing theapplication that fosters processing time reduction.

[0067] A primary objective is to establish a test process thatguarantees the integrity of the processing. The goal is to reduce thecycle consumption and not to transform the result of the processing. Itis also possible to establish a specific test script to validate theoptimization and/or use tools to compare the processing results.

[0068] The script makes easier the run of several tests and allows theprogrammer to gather information (traces) on the application behavior.The tools allow the creation of specific comparisons on the processeddata. The Cadence Cierto Signal Processing Worksystem (SPW) is capableof such a task and can speed the development cycle.

Stage G3.1: Optimization

[0069] For the high-level language implementation, optimizationpreferably uses tricks such as loop reduction, loop merging, testreduction, pointer usage, and in-line functions or macros, to reducecontext switching. These tricks are generic and can be used for most ifnot all high-level languages.

[0070] Another optimization step that can be integrated at this level isdevelopment chain optimization by addressing the specific options of thepre-processor, compiler, assembler, and/or linker. This may be useful ifthe implementation is initially done with the target developmentenvironment. Generally, the applications are initially developed on PCsor Workstations. Then, taking advantage of the generic compiler is notuseful and can lead to bad decisions in terms of performances and codesize.

[0071] At the implementation level, the developer assumes that thetarget DSP is fixed and that a simulator is available. Many Coptimizations as are known in the art are possible at this languagelevel.

Stage G3.2: Profiling

[0072] Each time a specific implementation is globally applied andvalidated the result is preferably benched, and if possible, fixed tofacilitate further optimization. Preferably, at least three parametersare integrated: the global effort in terms of time to integrate a newoptimization step, the processing time reduction that can be evaluated,and the code size evolution. These parameters preferably are correlatedto the time dedicated to the project and whether or not the applicationis mandatory to system functionality.

Processing Time Reduction

[0073] In most cases, the gain in processing time follows a x⁻¹ law.In-line functions, loop reduction and/or unrolling produce significantgain. Integrating pointers are normally less significant. However, acurve like that presented in FIG. 4, which regroups the measuresrealized for the generic implementation process, may be obtained.

[0074] A goal of these measures is to understand the impact of amodification. There is no generic rule that can be applied to all thecode and all of the applications that reduce the number of cycles.Modifications that appear to optimize cycle count can actually increaseit. Another goal is to fix a limit for the different optimization stepsin terms of time. One rule, for example, may be to measure more thanfive percent of cycle reduction between two steps.

Code Size Evolution

[0075] This development measure is necessary to have embeddedapplications that do not have several Mbytes of memory available on thefinal system. Experiments have shown that the generic C optimizationprocess considerably increases the code size. A fully dedicated Coptimization process generally decreases the size. However, theprogrammer preferably guarantees that the code size does not exceed theavailable memory size of the target system.

Specific Implementation Process (Target-dependent Implementation in theHigh-level Language)

[0076] In this process, some instructions to allow the use ofDSP-specific characteristics preferably are integrated into the C orother high-level language implementation. Many of the instructions maybe addressed by using pragma instructions that are placed in the code totake advantage of caches or internal RAM, loop counters,multiply-accumulate capabilities (MAC), and multiply-subtractcapabilities (MSU). Other specific characteristics like splitable ALUsor multipliers, parallel instruction execution, and pipeline effects areaddressed in the assembly level. For some DSPs, the only way to usethese characteristics is to handle them at the assembly level.Furthermore, this step requires that the developer perform the leastamount of tuning on the code to comply with the DSPs features.

[0077] Although the pragmas and intrinsics tend to detract from theportability, those parts of the code may be encapsulated and isolated.With the use of “#if-define” or other such conditional compiling flags,target compiler dependent flags can be integrated into the code so thatit is possible to recompile the same application for all the targets tobe addressed. However, this method of implementation requires a clearand structured versioning system as well as clear coding rules. One ofthe main issues arises from the need to support more than three or fourdifferent targets.

[0078] Another task of this stage is to implement the high-levellanguage (e.g., C) code and look at the effect obtained on the generatedassembler. The goal is not to modify assembly code but to write C codein a way that the assembler part of the compiler generates optimizedassembly code. The assumption is made that there is some specific Cimplementations that will impact the generated assembly code in the sameway for many compilers. The examples are the “do {} while” or the MACintegration. However, this is mainly true for the second and third DSPgenerations. One can also use the example of the post-registermodification. If the developer has realized a conversion of theimplementation to integrate pointers, the position in the code of thepointer increment automatically generates or not the post-registermodification in the assembly.

[0079]FIG. 5 is a flow diagram depicting a simplified representation ofthe main steps of the specific implementation process 500 represented asstep 106 in FIG. 1. Detailed steps of the specific implementationprocess are depicted in the task flow diagram of FIG. 6.

Stage SI: C-Optimized Code Impacting the Assembler

[0080] A key objective is to integrate specific pragmas and intrinsicsinto the code. The pragmas allow the use of cache or internal RAMmemories and integration of loop counters to optimize loop branches. Theother aspect of this optimization concerns the implementationmodifications that take advantage of the specific capabilities of thetarget DSP, including multiply-accumulate, multiply-subtract, splittablemultiply-add, and post register modification.

[0081] The goal is to generate the assembly code and observe what can bemodified in the C implementation that can be translated differently bythe compiler.

Stage S2: Specialized Low-Level Functions

[0082] Depending on the implementation structure, it may be necessary totune some specific functions that are used intensively. Some methods ofaccomplishing this include, for example, removing code that is not used,avoiding overhead introduced by recursive calls, moving loop invariantexpressions out of the loops, and reducing the scope of the variables(using macros integrates this concept naturally).

Code Portability

[0083] While the above-mentioned improvements may be viewed asnon-portable, they are portable in the sense that the overallarchitecture of the implementation can be ported and reused. To achievethat, the developer preferably encapsulates the specific instructions ofa DSP by integrating specific flags related to the target compiler inthe code, and by using a source versioning system to handle the varioustarget DSPs. Note that integrating specific flags can validate specificcode parts depending on those flags.

[0084]FIG. 7 is a graph depicting examples of curves of the evolution ofa software application with respect to its performance and size in atarget-dependent optimization process. As shown in FIG. 7, the sizedecreases slowly because specific points of the application areaddressed, but the impact on the performance can be impressive.

Fully-Dedicated Implementation Process

[0085] A fully dedicated implementation process is the lowest stage ofthe development process. Trade-offs on the application are preferablymade by removing some processing passes, wherever possible.Assembly-specific optimization is also integrated to finally reach thetarget performance.

[0086] A key challenge is to determine after a profiling if theperformance goal is reached. FIG. 8 is a task flow diagram depictingsteps of a dedicated implementation process 800 represented as step 108in FIG. 1. The dedicated implementation process includes two main steps,manual assembly optimization and feature tuning/cutting.

Manual Assembly Optimization

[0087] At this point, very low-level assembly language optimization isintegrated. Key characteristics for this implementation generally comefrom parallel instructions, pipeline effects, and not fully optimizedassembly code. Regarding parallel instructions, some DSPs are able toexecute several instructions in the same clock cycle. It is possible toexecute loads-operation-store in the same instruction. The mainobjective is to be able to integrate the pipeline effects that affectthe availability of the processed data.

[0088] With respect to pipeline effects, mainly in the branch call is itpossible to code specific instructions and take advantage of thepipeline delay slots. This optimization can be useful for the loopintensive applications. It is mandatory to handle the parallelinstruction optimization.

[0089] For the computationally intensive part of not fully optimizedassembly code, it may be necessary to reorganize the generated code andintegrate a more optimal way of using accumulators and registers.

Feature Tuning/Cutting

[0090] Depending on the capacities of the used DSP, it may be necessaryto re-adapt the application because it does not fit. If such a decisionis made, then the high-level steps of the process are not adapted orhave been neglected. It is necessary to re-evaluate the applicationbehavior in terms of processing, which is normally a high-level task ofthe process, for example, floating to fixed-point implementation.

[0091] One work-around is to drop out some specific part of theapplication that will have little impact on the quality of the processeddata. For example, in an audio processing application, functions such asa ring subtraction, a high-pass filter on the input signal, orcompression rate could be dropped without a significant loss ofperformance. Although the gain in terms of performance may not be high,cutting compression rate can suppress enough cycles to reach the targetperformance.

[0092] Reaching this step of the dedicated implementation process maymean that the application has not been evaluated correctly. If this isthe case, then optionally, some of the highest process levels may bere-addressed.

[0093] While preferred embodiments of the invention have been describedherein, and are further explained in the accompanying materials, manyvariations are possible which remain within the concept and scope of theinvention. Such variations would become clear to one of ordinary skillin the art after inspection of the specification and the drawings. Theinvention therefore is not to be restricted except within the spirit andscope of any appended claims.

What is claimed is:
 1. A method of optimizing a software program for atarget processor to meet performance objectives, where the softwareprogram is coded in a high-level language, the method comprising thesteps of: (a) optimizing the software program such that a resultingfirst optimized form of the software program is substantiallyindependent of the target processor and is substantially coded in thehigh-level language; (b) optimizing the first optimized form of thesoftware program such that a resulting second optimized form of thesoftware program is substantially dependent on the target processor andis substantially coded in the high-level language; and (c) optimizingthe second optimized form of the software program such that a resultingthird optimized form of the software program is substantially dependenton the target processor and is includes portions coded in a low-levellanguage of the target processor.
 2. The method of claim 1, furthercomprising steps of: (a1) determining a first performance profile forthe first optimized form of the software program, and comparing thefirst performance profile with the performance objectives; and (b1)determining a second performance profile for the second optimized formof the software program, and comparing the second performance profilewith the performance objectives.
 3. The method of claim 2, wherein steps(b), (b1), and (c) are not performed if the performance objectives aremet after completing step (a), and step (c) is not performed if theperformance objectives are met after completing step (b).